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Abstract 

i-Q ' Suppose that you are navigating in "hypcrspacc" and you have reached a web page 

r^, with several outgoing hnks you could choose to follow. Which link should you choose 

CJ ' in such an online scenario? One extreme case is that you know exactly where you are 

O . heading and you have no problem in choosing a link to follow. In all other cases, when you 

are not sure where the information you require resides, you will initiate a navigation (or 
"surfing" ) session. This involves pruning (or discounting) some of the links and following 
one of the others, where more pruning is likely to happen the deeper you navigate. In 
^.,^. terms of decision making, the utility of navigation diminishes with distance until finally 

(~| ' the utility drops to zero and the session is terminated. Under this model of navigation, we 

Oh< call the number of nodes that are available after pruning, for browsing within a session, 

the potential gain of the starting web page. Thus the parameters that effect the potential 
gain are the local branching factor with respect to the starting web page and the discount 
^ ' factor. 



C/3 



/^ ■ We first consider the case when the discounting factor is geometric. We show that the 

Qs I distribution of the effective number of links that the user can follow at each navigation 

^T) • step after pruning, i.e. the number of nodes added to the potential gain at that step, 

is given by the erf function, which is related to the probability density function for the 

Normal distribution. We derive an approximation to the potential gain of a web page 
t~^ \ and show that this is numerically a very accurate estimate. We also obtain lower and 

^D ■ upper bounds on the potential gain. We then consider a harmonic discounting factor and 

show that, in this case, the potential gain at each step is closely related to the probability 

density function for the Poisson distribution. 

The potential gain has been applied to web navigation where, given no other infor- 
C^ I mation, it helps the user to choose a good starting point for initiating a surfing session. 

Another application is in social network analysis, where the potential gain could provide 
a novel measure of centrality. 

1 Introduction 

In order to find information on the World- Wide- Web, "surfers" often adopt the following 



X 



two-stage strategy [ Lev05 ]. First they submit their query to a global web search engine, such 
as Google or Yahoo, which directs them to the home page of the subdomain within the web 
site that is likely to contain the information they are looking for. Then they navigate within 
this web site by following hyperlinks until they either find the information they are seeking, or 
they restart their search by reformulating their original query and then repeating the process. 
In some cases users simply give up their search task when they lose the context in which 



they were browsing and are unsure how to proceed in order to satisfy their original goals. 



This phenomenon is known as the navigation problem [LL02] or colloquially as "getting lost 
in hyperspace" | NieO0 |. 



Although, as far as we know, global web search engines attach higher weights to home 
pages than to other pages, they do not have a general mechanism to take into consideration 
the navigation potential of web pages. Our aim in this paper is to investigate the problem of 
finding "good" starting points for web navigation that are independent of the user's query. 
Once we have available such a measure, we can weight this information into the user's query 
in order to find "good" points for starting navigation given the actual query. Hereafter we 
shall refer to the measure of navigation potential of a web page as its potential gain. We 
note that the application that initially led us to look into the potential gain is the search and 
navigation engine we have developed for semi-automating user navigation within web sites 



WL03 |, but we believe that this notion has wider applicability within the general context of 



web search tools [LW04]. 

In view of the above, we would like to choose a web page (or more technically a URL, 
i.e. a Uniform Resource Locator) from which to start navigation that in some well-defined 
sense maximises the potential of the user to realise his/her "surfing" goal. The only a priori 
information that may be available is partial knowledge of the topology of the web, i.e. the 
set of URLs which are reachable from a given starting URL. This information amounts to 
some knowledge about the density of web pages in the neighbourhood of the starting URL. 
Essentially, if this neighbourhood is denser, i.e. we can potentially reach many URLs in a 
short distance, then we consider the potential gain, or utility, of this URL to be high. For 
example, the home page of a web site is normally a "good" starting URL for navigation 
precisely for the reason that there is a wealth of information reachable from it. 

Assuming that we are navigating within the web graph, the potential gain of a starting 
URL is, informally, the number of URLs that can be reached from the starting point, where at 
each step the number of outgoing links is successively discounted depending on the distance 
from the starting point. We investigate two discounting functions, geometric and harmonic. 
For geometric discounting we show that the potential gain values follow a Normal distribution 
with respect to the distance from the starting point, while for harmonic discounting the 
distribution is Poisson. Moreover, for geometric discounting, we derive an approximation 
to the potential gain, which is numerically very accurate, and also derive lower and upper 
bounds. 

The rest of the paper is organised as follows. In Section |2| we give a formal definition of 
the potential gain of a web page, and derive bounds on it, assuming a geometric discounting 
factor. In Section ^ we provide a brief computational analysis of the distribution of the 
potential gain values and demonstrate the tightness of the derived bounds. In Section ^ 
we investigate the potential gain when utilising a harmonic discounting factor. Finally, in 
Section |5| we give our concluding remarks. For graph-theoretic concepts and background we 
refer the reader to [ BII9C| ] . 



2 The Potential Gain of a Web Page 

Let us assume that the user is in the midst of a navigation session having started from a 
certain URL, say U. The user is browsing a web page and has to decide whether to follow one 



of the links on the page or to terminate the session. We make the assumption that the utility of 
browsing a web page diminishes with the distance of that page from, the starting URL U. This 
assumption is consistent with experiments carried out on real web data [ HPPL9^ , LBL01|. 



So a user browsing a page at distance d from U will prune from the links actually present 
those considered to be not worth following; and for larger d a larger proportion of the links 
will be pruned. For this purpose we define the discount factor 5, with < 6 < 1, and assume 
that, at distance d, the user will only inspect the fraction 6 of the currently available links, 
prior to following one of these. Some of the links may be pruned because they lead to pages 
that the user has already inspected, whilst others may be pruned as a result of filtering, for 
example, by picking up the "scent of information" pirOTl - 

We model the web graph as a directed graph G = {hl,£) having a set of nodes (or URLs) 
U and a set of arcs (or links) £. For convenience we will assume that G is strongly connected, 
although this restriction could be relaxed. To formalise our model of the user, we need to 
estimate the local branching factor (3 of G with respect to a given starting URL U: this is 
a local estimate of the number of outlinks per node. For this purpose we define an integer 
parameter A, called clicks, where A > 1; this denotes the mean number of clicks (rounded 
down) a user makes during a navigation session, i.e. links she follows before terminating 



her session. (See [ HPPL9S , LBL01 | for an analysis of the distribution of clicks.) The local 
branching factor gives an estimate of how many links, on average, the user has to choose from, 
and clicks gives an estimate of the number of links, on average, she will traverse during a 
navigation session. Given A, let reach{U) be the subgraph of G induced by traversing G in a 
breadth-first manner to depth A, starting from U. We then define (3 as the average branching 
factor (i.e. out-degree) of the nodes in reach(U). (We note that, in our breadth-first traversal, 
we do not keep a record of the nodes visited, so we may visit a node more than once.) In an 
online scenario an estimate of f3 may be obtained by sampling in the vicinity of U, or from 
preprocessed log data of previous surfers who have visited U. Suppose we have determined 
the structure of the subgraph of reach{U) obtained by searching to some depth A* < A. We 
can then compute /3d, the average branching factor of the nodes at depth d, < d < A*, as 
the arithmetic mean of the branching factors of the nodes at depth d. In order to maintain 
consistency with the total number of nodes at level A* , we suggest using the geometric mean 
of the (3d, < d < A*, as an estimate of (3. An estimate of 5 can then be obtained from (3 
and A, as we show later. 

Hence, given (3, the effective branching factor at depth i is f35^, and the potential number 
of available nodes at this depth is approximately 

f36^ (36^---P6'-^ = f3' 6'^'-^^/^. (1) 



The total potential gain of U, denoted by PG{U), is simply the total number of available 
nodes at all depths, i.e. 

oo 

PG([/) = ^;3^ 5^(*-^)/2_ (2) 

i=0 

We observe that the potential gain, as defined in the above equation, differs from the 



PageRank | BGS05 , Ber05 | - the most studied link analysis metric - in that the discounting 



factor 6 gives rise to a double exponential, thus guaranteeing that the effective branching 



factor monotonically decreases to zero. Consequently, the portion of the web graph that 
is potentially reachable during a session is bounded. In the PageRank model, the effective 
branching factor is always greater than one and, consequently, the PageRank depends on the 
entire web graph. Moreover, in the PageRank model, the (random) surfer wanders on ad 
infinitum, whereas, in the navigation-based model presented here, the length of the surfer's 
session is limited by the diminishing branching factor. This allows us to approximate (pi) using 
the erf function. (We note that the potential gain may be viewed as a generalised ranking 
algorithm [ BBC06| with a double exponential damping function; this type of damping function 



was not considered in [ BBC06|| . 



Setting a = (36 ^^'^, 9 = 5^/^ and A^ = ln(l/6'), the potential gain of U up to depth d, 
denoted by PGd{U), is given by 

i=0 i=0 

To approximate PG{U), we need to find the greatest depth d such that 

a'^ 61^' > 1, i.e. aO"^ > 1, 

since for greater depths the number of available nodes will be less than one; this value of d 
corresponds to A. Thus 



ln(a) 


= 


ln(G) 
L A2 J 


Un(l/0)J 



Now, let 

ln(a)^^ln^ 

A2 ln(l/5) ^ ' ^ ' 

noting that A = [A^J . (Since A ^ N, given /3 and A, we can thus derive an approximation 
to 5.) 

We claim that a^9^ attains its maximum at x = N/2. To show this we take its derivative, 
obtaining 

A L^'O'-'') =a^r'('ln(a)+2xln(0) 



which is equal to zero at 



_ ln(a) _ ln(a) _ N 
^~ 21n(l/0) ~^A^~y ^' 



It can be verified that the second derivative of a^9^ at x = N/2 is negative and thus this 
function has a maximum at this point. 

We next proceed to find an approximation of (^) by using the Euler-Maclaurin summation 
formula |Fro65|. 

Now 6 = e~^ and, from (^), a = e , so 



Therefore, from (O) 



PG,(^) = e^^^V4;^,- 



-A2(i-Ar/2)2 _ A2Ar2/4 



E/(^ 



(6) 



i=0 



i=0 



where 



/(:r)=e-"'(^-^/2)'. 



For compactness we let 5^ = e ^ ^ '^ PGd{U). We now bound Sd using the following 
version of the Euler-Maclaurin summation formula, truncated after the term involving the 



first derivatives (see [ Fro65 , p. 211]) 



d "■ 

Sd = Yl /(^) = / /(^)d^ + I [/(O) + /(^)] + ^ [fid) - /'(O)] - R,, 
j=0 n 



where the remainder term R^ satisfies 



d 



Rd = ^f'\i), 

for some ^, with < ^ < d. 

We first consider the definite integral. Making the substitution 



we obtain 



y = \{x- N/2), 

\n/2 



f{x) dx 



1 

A 



dy, 



-AAf/2 

where n = 2d — N . 

Expressing this in terms of the well-known error function [AS72, 7.1.1], 



X 

2 /■ _ 2 

erf(x) = —= / e~^ dy, 
V vr J 





2 

and using that fact that e~^ is an even function, we obtain 





(7) 



(8) 



Using the formulae in the Appendix to get an expression for /', we easily obtain the 
following expression for the other terms on the right-hand side of (|^) , apart from the remainder 
term R^, 



\ [/(O) + f{d)] + ^ [fid] - f 



'l \^N\ _a2jvV4 , A ^^"' 



-A2nV4 



12 



2 12 



(9) 



We now turn our attention to the remainder term Rd- This satisfies 

for some ry, with —XN/2 < rj < Xn/2, where F{r]) = e~^ . 
Using (Il4|) in the Appendix, this gives 



X^d ^ , X^d 
96 
Substituting (g) and @ into (0), we obtain 



■W <"■''- mf <"' 



^' - f (- (f ) - - (t) ) H^ - ^) '^'""'" H^ ^) '-''"'" - "'■ 

(11) 

Together with (|lO|), this immediately gives bounds on PGd{U) since PGd{U) = e '^Sd- 
We may then estimate the total potential gain PG{U) as PGa{U) by putting d = A. 

3 Distribution of Potential Gain Values 

In this section we examine some aspects of the potential gain function and the distribution 
of its values. 

We assume that A, the mean number of user clicks per navigation session, is about 10; 



this is quite close to 8.32 reported in [HPPL98]. We also assume that the local branching 



factor P is between 2 and 25; see |DKM+02| for data on branching factors for different subsets 
of the web. We note that, in the case when /? = 1, we have A^ = A = 1 and PGiq{U) = 2, 
implying that there is no choice for the user. In Table ^ we give, for 2 < /3 < 25, various 
quantities related to the potential gain. These were computed as follows: 

(i) From (§, 5 = /3-2/(^-i). 
(ii) By definition, A = (i ln(l/5))i/2. 

(iii) From (pf) we know that max, the maximum of a^6^ , is attained for x = N/2; the 
maximum value is easily shown to be /?^ /4(Af-i)_ 

(iv) PGiq{U), our estimate of PG{U), is given by (^). By regression on the plot shown in 
Figure |l|, we found that this is a good fit to the power-law: 

PGio(C/)« 5.162/32-605, 



(v) noR is the approximation to PGiq{U) obtained from (11) if we set Riq to 0. 

(vi) The upper (ub) and lower (lb) bounds on PGio{U) are derived from ( p!o[ ) and (11). 

Although the average of the lower and upper bounds given in Table || yields a reasonably 
good approximation to PGiq{U), we see that noR, i.e. the approximation obtained if we 
ignore the remainder term, is extremely close to the actual value of PGiq{U). In Figure ^ we 
show a typical plot of the potential gain against the depth d. 
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5 


A 


max 


PGioiU) 


noR 


lb 


ub 


{lb + ub)/2 


2 


0.86 


0.28 


6.86 


42.49 


42.49 


42.49 


42.5 


42.49 


3 


0.78 


0.35 


21.15 


106.65 


106.65 


106.6 


106.68 


106.64 


4 


0.73 


0.39 


47.03 


211.98 


211.97 


211.79 


212.09 


211.94 


5 


0.7 


0.42 


87.41 


366.08 


366.07 


365.6 


366.36 


365.98 


6 


0.67 


0.45 


145.05 


575.98 


575.96 


575 


576.56 


575.78 


7 


0.65 


0.46 


222.58 


848.26 


848.24 


846.51 


849.32 


847.91 


8 


0.63 


0.48 


322.54 


1189.17 


1189.15 


1186.28 


1190.94 


1188.61 


9 


0.61 


0.49 


447.38 


1604.7 


1604.67 


1600.23 


1607.45 


1603.84 


10 


0.6 


0.51 


599.48 


2100.59 


2100.55 


2094.01 


2104.64 


2099.33 


11 


0.59 


0.52 


781.19 


2682.38 


2682.34 


2673.1 


2688.12 


2680.61 


12 


0.58 


0.53 


994.78 


3355.48 


3355.43 


3342.79 


3363.33 


3353.06 


13 


0.57 


0.53 


1242.47 


4125.1 


4125.05 


4108.23 


4135.57 


4121.9 


14 


0.56 


0.54 


1526.47 


4996.36 


4996.31 


4974.43 


5009.98 


4992.21 


15 


0.55 


0.55 


1848.93 


5974.24 


5974.18 


5946.29 


5991.62 


5968.95 


16 


0.54 


0.56 


2211.96 


7063.61 


7063.56 


7028.57 


7085.42 


7057 


17 


0.53 


0.56 


2617.66 


8269.26 


8269.2 


8225.97 


8296.22 


8261.1 


18 


0.53 


0.57 


3068.09 


9595.88 


9595.81 


9543.07 


9628.78 


9585.92 


19 


0.52 


0.57 


3565.28 


11048.06 


11047.99 


10984.39 


11087.74 


11036.07 


20 


0.51 


0.58 


4111.23 


12630.34 


12630.27 


12554.35 


12677.72 


12616.03 


21 


0.51 


0.58 


4707.94 


14347.18 


14347.1 


14257.31 


14403.22 


14330.27 


22 


0.5 


0.59 


5357.37 


16202.96 


16202.88 


16097.56 


16268.71 


16183.13 


23 


0.5 


0.59 


6061.46 


18202.01 


18201.93 


18079.31 


18278.57 


18178.94 


24 


0.49 


0.59 


6822.13 


20348.61 


20348.53 


20206.75 


20437.14 


20321.94 


25 


0.49 


0.6 


7641.29 


22646.97 


22646.88 


22483.97 


22748.69 


22616.33 



Table 1: Tabulation for A^ = A = 10 



4 An Alternative Discounting Factor 

We next look at an alternative discounting factor. We assume that, at distance d from the 
starting URL U , the user will only inspect l/(d + 1) of the currently available links, prior to 
following one of them. Thus, given /?, the effective branching factor at depth i is /?/(« + 1) 
and the potential number of available nodes nodes at this depth is 



(12) 



T 2 '"T ~ TT' 

which corresponds to (^. 

The alternative potential gain of U, denoted by APG{U), is now simply the total number 
of available nodes at all depths, i.e. 



APG{U) = Y: 



P' 



(13) 



i=0 



which corresponds to (y). The alternative potential gain of U up to depth d, denoted by 
APGfi(U), is obtained by replacing the upper limit of the sum by d. 
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Figure 1: Plot of PGio(f/) against /3 



As in the case of PG{U), in order to approximate APG{U), we need to find the maximum 
depth d such that 



> 1. 



By Stirhng's approximation [ GKP94| |, ln(d!) w d In d — d, so we have (approximately) that 
d < ep. Thus A = [eP\ . 

We next consider the maximum term in the sum (p^), i.e. APG{U), as we did for PG{U). 
It is straightforward to show that the maximum term is at i = [/3J . Ignoring rounding 
errors and using Stirhng's approximation, we see that the maximum term is approximately 
e^/(27r/3)"^'^. (When P is an integer the maximum is also attained when i = 13— 1.) 

Taking d = A = 10 we obtain the branching factor (3 = 3.6788. In Figure ^ we show a 
typical plot of the alternative potential gain against the depth d. In this case the sum of the 
first 10 terms in (13) is 39.54, which is a good approximation of e^'^^^^ = 39.60. 



5 Concluding Remarks 

We defined a measure of navigability, called the potential gain, that provides a model of user 
navigation in the web. This can help the user in an online scenario to choose a starting 
URL for navigation, given no other information. One important factor that distinguishes the 



potential gain from other link analysis metrics [LM0£], such as Google's PageRank, is that 
it measures "hubness", i.e. the accessibility from the page of information on others pages. 
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Figure 2: Plot of the distribution of PGd{U) for iV = 10 and /? = 10 



rather than authority, i.e. the accessibihty from elsewhere of information on the page. (See 
also our comment after (^) regarding another important distinction between the potential 
gain and PageRank.) Whereas PageRank measures authority, the Hyperlink-Induced Topic 



Search (HITS) algorithm | Kle99 identifies both hubs and authorities, but its computation 
is query-specific. In this context, it is worth noting that the potential gain is related to the 
notion of centrality | Fre7£ ], which is a fundamental notion in social network analysis [pcoOO |. 
The potential gain has been applied in a search and navigation engine that we have 
developed. Its distinctive feature is that an answer to a user query suggests several possible 



navigation paths that the user can follow [WL03|, rather than just individual web pages 
as suggested by conventional search engines. As part of the search and navigation engine, 
potential gain values are pre-computed for each page in the web site being searched; these 
are then used to select good starting URLs for navigation |LW04]. 



Appendix 

We obtain here the derivatives of f{x). To simplify the calculations, it is convenient to let 
y = \{x — N/2) and define 

F{y) = e-y' = fix). 

The derivatives of f{x) are determined from the derivatives of F[y), since 

f^^\x)=\^F^^\y). 
By straightforward differentiation we obtain: 

F'{y) = -2ye-y" 




Figure 3: Plot of the distribution of APGd{U) for A = 10 and /? = 3.6788 



F"iy) = (V-2)e-^' 

F"'{y) = {-8y' + 12y)e-y" 

F^^\y) = (16y^ - 48^^ + 12) e 

F^^\y) = (-32/ + 160y^ - 120y) e~ 



.-y' 



These functions are closely related to the Hermite polynomials | Fro65 , p. 189]. 

We also require the extreme values of F^^'{y). Using a straightforward calculation, it is 
readily verified that the local extrema of this function are 



F(4)(y) 
fW(2/) 
FW(y) 



Thus 



12 at y = 0, 

-7.42 at y = ±((5 - ^)/2)^/^ 

1.39 at y = ±((5 + \/l0)/2)^/^ 

-7.5<F(^)(y) < 12. 



(14) 
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